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Abstract 

We propose a novel algorithm for greedy forward fea- 
ture selection for regularized least-squares (RLS) re- 
gression and classification, also known as the least- 
squares support vector machine or ridge regression. 
The algorithm, which we call greedy RLS, starts from 
the empty feature set, and on each iteration adds 
the feature whose addition provides the best leave- 
one-out cross-validation performance. Our method 
is considerably faster than the previously proposed 
ones, since its time complexity is linear in the number 
of training examples, the number of features in the 
original data set, and the desired size of the set of se- 
lected features. Therefore, as a side effect we obtain a 
new training algorithm for learning sparse linear RLS 
predictors which can be used for large scale learning. 
This speed is possible due to matrix calculus based 
short-cuts for leave-one-out and feature addition. We 
experimentally demonstrate the scalability of our al- 
gorithm and its ability to find good quality feature 
sets. 

1 Introduction 

Feature selection is one of the fundamental tasks in 
machine learning. Simply put, given a set of features, 
the task is to select an informative subset. The se- 
lected set can be informative in the sense that it guar- 
antees a good performance of the machine learning 
method or that it will provide new knowledge about 
the task in question. Such increased understanding 
of the data is sought after especially in life sciences, 
where feature selection is often used, for example, to 
find genes relevant to the problem under considera- 



tion. Other benefits include compressed representa- 
tions of learned predictors and faster prediction time. 
This can enable the application of learned models 
when constrained by limited memory and real-time 
response demands, as is typically the case in embed- 
ded computing, for example. 

The feature selection methods are divided by [l] 
into three classes, the so called filter, wrapper, and 
embedded methods. In the filter model feature se- 
lection is done as a preprocessing step by measur- 
ing only the intrinsic properties of the data. In the 
wrapper model [2] the features are selected through 
interaction with a machine learning method. The 
quality of selected features is evaluated by measur- 
ing some estimate of the generalization performance 
of a prediction model constructed using them. The 
embedded methods minimize an objective function 
which is a trade-off between a goodness-of-fit term 
and a penalty term for a large number of features 
(see e.g. [3]). 

On one hand, the method we propose in this work 
is a wrapper method, since equivalent results can be 
obtained by using a black-box learning algorithm to- 
gether with the considered search and performance 
estimation strategies. On the other hand it may also 
be considered as an embedded method, because the 
feature subset selection is so deeply integrated in the 
learning process, that our work results in a novel ef- 
ficient training algorithm for the method. 

As suggested by [4j , we estimate the generalization 
performance of models learned with different subsets 
of features with cross-validation. More specifically, 
we consider the leave-one-out cross-validation (LOO) 
approach, where each example in turn is left out of 
the training set and used for testing. Since the num- 



ber of possible feature combinations grows exponen- 
tially with the number of features, a search strategy 
over the power set of features is needed. The strategy 
we choose is the greedy forward selection approach. 
The method starts from the empty feature set, and on 
each iteration adds the feature whose addition pro- 
vides the best leave-one-out cross-validation perfor- 
mance. That is, it performs a steepest descent hill- 
climbing in the space of feature subsets of size up to 
a given limit k £ N and is greedy in the sense that 
it does not remove features once they have been se- 
lected. 

Our method is built upon the Regularized least- 
squares (RLS), also known as the least-squares sup- 
port vector machine (LS-SVM) and ridge regression, 
which is is a state-of-the art machine learning method 
suitable both for regression and classification [5 
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An important property of the algorithm is that it 
has a closed form solution, which can be fully ex- 
pressed in terms of matrix operations. This allows 
developing efficient computational shortcuts for the 
method, since small changes in the training data ma- 
trix correspond to low-rank changes in the learned 
predictor. Especially it makes possible the devel- 
opment of efficient cross-validation algorithms. An 
updated predictor, corresponding to a one learned 
from a training set from which one example has been 
removed, can be obtained via a well-known compu- 
tationally efficient short-cut (see e.g. 12-14]) which 



in turn enables the fast computation of LOO-based 
performance estimates. Analogously to removing the 
effect of training examples from learned RLS predic- 
tors, the effects of a feature can be added or removed 
by updating the learned RLS predictors via similar 
computational short-cuts. 

Learning a linear RLS predictor with k features 
and m training examples requires 0(min{fc 2 m, km 2 }) 
time, since the training can be performed either in 
primal or dual form depending whether m > k or 
vice versa. Given that the computation of LOO per- 
formance requires m retrainings, that the forward se- 
lection goes through of the order of 0(n) features 
available for selection in each iteration, and that the 
forward selection has k iterations, the overall time 
complexity of the forward selection with LOO crite- 
rion becomes 0(mm{k 3 m 2 n, k 2 m 3 n}) in case RLS is 



used as a black-box method. 

In machine learning and statistics literature, there 
have been several studies in which the computational 
short-cuts for LOO, as well as other types of cross- 
validation, have been used to speed up the evaluation 
of feature subset quality for ordinary non-regularized 



least-squares (see e.g 15 16 ) and for RLS (see e.g. 
I~7|[l8] ). However, the considered approaches are still 



be computationally quite demanding, since the LOO 
estimate needs to be re-calculated from scratch for 
each considered subset of features. 
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Recently, 

troduced a method which uses additional computa- 
tional short-cuts for efficient updating of the LOO 
predictions when adding new features to those al- 
ready selected. The computational cost of the incre- 
mental forward selection procedure is only 0(km 2 n) 
for the method. The speed improvement is notable 
especially in cases, where the aim is to select a large 
number of features but the training set is small. How- 
ever, the method is still impractical for large training 
sets due to its quadratic scaling. 

In this paper, we improve upon the above results 
by showing how greedy forward selection according to 
the LOO criterion can be carried out in 0{kmn) time. 
Thus, our algorithm, which we call greedy RLS or 
greedy LS-SVM, is linear in the number of examples, 
features, and selected features, while it still provides 
the same solution as the straightforward wrapper ap- 
proach. As a side effect we obtain a novel training 
algorithm for linear RLS predictors that is suitable 
for large-scale learning and guarantees sparse solu- 
tions. Computational experiments demonstrate the 
scalability and efficiency of greedy RLS, and the pre- 
dictive power of selected feature sets. 



2 Regularized Least-Squares 

We start by introducing some notation. Let ]R m and 
M. nxm , where n,m € N, denote the sets of real valued 
column vectors and n x m-matrices, respectively. To 
denote real valued matrices and vectors we use cap- 
ital letters and bold lower case letters, respectively. 
Moreover, index sets are denoted with calligraphic 
capital letters. By denoting Mj, M :J , and Mij, we 
refer to the ith row, jth column, and ?,jth entry of 
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the matrix M £ R" xm , respectively. Similarly, for 
index sets 1Z £ {1, . . . , n} and L £ {1, . . . , m}, we 
denote the submatrices of M having their rows in- 
dexed by TZ, the columns by C, and the rows by 1Z 
and columns by C as M n , M.^, and Mn,c, respec- 
tively. We use an analogous notation also for column 
vectors, that is, Vj refers to the ith entry of the vector 
v. 

Let X £ jj nxm be a matrix containing the whole 
feature representation of the examples in the training 
set, where n is the total number of features and m 
is the number of training examples. The i,jth entry 
of X contains the value of the ith feature in the jth 
training example. Moreover, let y £ E m be a vector 
containing the labels of the training examples. In 
binary classification, the labels can be restricted to 
be either — 1 or 1, for example, while they can be any 
real numbers in regression tasks. 

In this paper, we consider linear predictors of type 

/(x) = w T x 5 , (1) 

where w is the |5|-dimensional vector representation 
of the learned predictor and xs can be considered as 
a mapping of the data point x into | S | -dimensional 
feature spaceQ Note that the vector w only contains 
entries corresponding to the features indexed by S. 
The rest of the features of the data points are not 
used in prediction phase. The computational com- 
plexity of making predictions with ([I]) and the space 
complexity of the predictor are both O(fc), where k 
is the desired number of features to be selected, pro- 
vided that the feature vector representation xg for 
the data point x is given. 

Given training data and a set of feature indices 
S ', we find w by minimizing the so-called regularized 
least-squares (RLS) risk. Minimizing the RLS risk 
can be expressed as the following problem: 

argmm((w T A 5 ) T - y) T ((w T X 5 ) T - y) + Aw T w. 

w<ERl s 

(2) 

The first term in ([2]), called the empirical risk, mea- 
sures how well the prediction function fits to the 

1 la the literature, the formula of the linear predictors often 
also contain a bias term. Here, we assume that if such a bias 
is used, it will be realized by using an extra constant valued 
feature in the data points. 



training data. The second term is called the regu- 
larizer and it controls the tradeoff between the loss 
on the training set and the complexity of the predic- 
tion function. 

A straightforward approach to solve ([2| is to set 
the derivative of the objective function with respect 
to w to zero. Then, by solving it with respect to w, 
we get 

w= (X S (X S ) T + XI)- 1 X s y, (3) 

where I £ ]g> mxm i s the identity matrix. We note (see 
e.g. [20]) that an equivalent result can be obtained 
from 



v f = X s ((X s )' T X s + \I)- 1 y. 



(4) 



If the size of the set S of currently selected features is 
smaller than the number of training examples m, it is 
computationally beneficial to use the form ([3| while 
using Q is faster in the opposite case. Namely, the 
computational complexity of the former is 0( |<S| 3 + 
|<S| 2 m), while the that of the latter is (9(m 3 + m 2 \S\), 
and hence the complexity of training a predictor is 
0(mm{|S| 2 TO,7n 2 |S|}). 

To support the considerations in the following sec- 
tions, we introduce the dual form of the prediction 
function and some extra notation. According to [7j, 
the prediction function ([I]) can be represented in dual 
form as follows 

/(x)=a T (X 5 ) T x 5 . 

Here a £ M. m is the vector of so-called dual variables, 
which can be obtained from 



where 



and 



a = Gy, 



G=(K + Xiy 



K = (XsfX s . 



(5) 



(6) 



Note that if S = 0, K is defined to be a matrix whose 
every entry is equal to zero. 

In the context of kernel-based learning algorithms 
(see e.g. 21 - 23 ), if and a are usually called the ker- 
nel matrix and the vector of dual variables, respec- 
tively. The kernel matrix contains the inner prod- 
ucts between the feature vectors of all training data 
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points. With kernel methods, the feature vector rep- 
resentation of the data points may be even infinite 
dimensional as long as there is an efficient, although 
possibly implicit, way to calculate the value of the 
inner product and the dual variables are used to rep- 
resent the prediction function. However, while the 
dual representation plays an important role in the 
considerations of this paper, we restrict our consid- 
erations to feature representations of type X5 which 
can be represented explicitly. 

Now let us consider some well-known efficient ap- 
proaches for evaluating the LOO performance of a 
trained RLS predictor (see e.g. 12]-T4p4 and for the 
non-regularized case, see e.g 15 .16|). The LOO pre- 
diction for the jth training example can be obtained 
in constant time from 



where 
and 



(i-qj-r'te-w). 



f = (w T A s ) T 



(Xs tj ) T (Xs(X s ) T + xi)- 1 x s ^ 



(7) 



provided that the vectors f and q are computed and 
stored in memory. The time complexity of computing 
f and q is O ( |<S| 3 + |6>| 2 to), which is the same as that 
of training RLS via ([3|. Alternatively, the constant 
time LOO predictions can be expressed in terms of 
the dual variables a and the diagonal entries of the 
matrix G as follows: 



(8) 



Here, the time complexity of computing a and the 



diagonal entries of the matrix G is 0(r 



\S\), 



which is equal to the complexity of training RLS via 
Q. Thus, using cither jfj or the LOO perfor- 
mance for the whole training set can be computed 



in 0(min{|S| z 



|5|}) time, which is equal to the 



time complexity of training RLS. 



3 Feature Selection 

In this section, we consider algorithmic implementa- 
tions of feature selection for RLS with leave-one-out 



(LOO) criterion. We start by considering a straight- 
forward wrapper approach which uses RLS as a black- 
box method in Section |3.1| Next, we recall a previ- 
ously proposed algorithm which provides results that 
are equivalent to those of the wrapper approach but 
which has computational short-cuts to speed up the 
feature selection in Section 3.2 In Section 3.3 we 



present greedy RLS, our novel linear time algorithm, 
which again provides the same results as the wrap- 
per approach but which has much lower computa- 
tional and memory complexity than the previously 
proposed algorithms. 

3.1 Wrapper Approach 



Algorithm 1: Standard wrapper algorithm for 
RLS 

Input: X e R nxm , y € R m , k, A 

Output: S, w 

1 5^0; 

2 while |<S| < k do 

3 e <— 00; 

4 6^0; 

5 foreach i E {1, . . . , n} \ S do 

6 K<-S{j{t}; 

7 e t <- 0; 

8 foreach j E {1, . . . , m} do 

9 £<-{l,...,m}\{j}; 

10 w 4— t(X-ji£,yc, A); 

11 p^w T X n , J ; 

12 ei 4- ei + l(yj,p); 

13 end 

14 if ei < e then 

15 e <— ei; 

16 b <— i; 

17 end 
is end 

19 S^SU{b}; 

20 end 

21 w <- t(X s ,y, A); 



Here, we consider the standard wrapper approach 
in which RLS is used as a black-box method which is 
re-trained for every feature set to be evaluated during 
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the selection process and for every round of the LOO 
cross-validation. In forward selection, the set of all 
feature sets up to size k is searched using hill climb- 
ing in the steepest descent direction (see e.g. [25] ) 
starting from an empty set. The method is greedy in 
the sense that it adds one feature at a time to the set 
of selected features but no features are removed from 
the set at any stage. The wrapper approach is pre- 
sented in Algorithm [l] In the algorithm description, 
t(-, ■, •) denotes the black-box training procedure for 
RLS which takes a data matrix, a label vector, and a 
value of the regularization parameter as input and re- 
turns a vector representation of the learned predictor 
w. The function computes the loss for one train- 
ing example given its true label and a predicted label 
obtained from the LOO procedure. Typical losses 
are, for example, the squared loss for regression and 
zero-one error for classification. 

Given that the computation of LOO performance 
requires m retrainings, that the forward selection 
goes through 0{n) features in each iteration, and 
that k features are chosen, the overall time com- 
plexity of the forward selection with LOO criterion 
is 0(min{fc 3 m 2 n, k 2 m 3 n}). Thus, the wrapper ap- 
proach is feasible with small training sets and in cases 
where the aim is to select only a small number of 
features. However, at the very least quadratic com- 
plexity with respect to both the size of the train- 
ing set and the number of selected features make the 
standard wrapper approach impractical in large scale 
learning settings. The space complexity of the wrap- 
per approach is 0{nm) which is equal to the cost of 
storing the data matrix X. 

An immediate reduction for the above considered 
computational complexities can be achieved via the 
short-cut methods for calculating the LOO perfor- 
mance presented in Section [2] In machine learning 
and statistics literature, there have been several stud- 
ies in which the computational short-cuts for LOO, as 
well as other types of cross-validation, have been used 
to speed up the evaluation of feature subset quality 



ing the predictor itself. However, both 18 and 17 



for non-regularized least-squares (see e.g 15 16]) and 
for RLS (see e.g. 17 18]). For the greedy forward 



selection, the short-cuts reduce the overall compu- 
tational complexity to 0(min{fc 3 



considered only the dual formulation (|8j) of the LOO 
speed-up which is expensive for large data sets. Thus, 
we are not aware of any studies in which the primal 
formulation Q would be implemented for the greedy 
forward selection for RLS. 



3.2 Previously Proposed Speed-Up 

We now present a feature selection algorithm pro- 



posed by 19 which the authors call low-rank updated 



LS-SVM. The features selected by the algorithm are 
equal to those by standard wrapper algorithm and by 
the method proposed by [17] but the method has a 
lower time complexity with respect to k due to cer- 
tain matrix calculus based computational short-cuts. 
The low-rank updated LS-SVM algorithm for feature 
selection is presented in Algorithm [2] 

The low-rank updated LS-SVM is based on updat- 
ing the matrix G and the vector a as new features 
are added into S. In the beginning of the forward 
update algorithm, the set S of selected features is 
empty. This means that the matrix G contains no 
information of the data matrix X. Instead, every en- 
try of the matrix K is equal to zero, and hence G 
and a are initialized to A -1 / and A _1 y, respectively, 
in lines [2] and [3] in Algorithm [2] When a new feature 
candidate is evaluated, the LOO performance corre- 
sponding to the updated feature set can be calculated 
using the temporarily updated G and a, denoted in 
the algorithm as G and a, respectively. Here, the 
improved speed is based on two short-cuts. Namely, 
updating the matrix G via the well known Sherman- 
Morrison- Woodbury (SMW) formula and using the 
fast LOO computation. 

Let us first consider updating the matrix G. For- 
mally, if the iih feature is to be evaluated, where 
i £ S, the algorithm calculates the matrix G corre- 
sponding the feature index set S + {i}. According to 
5k, the matrix G is 



(K 



xiy 



(9) 



, k 2 m 2 n}), since 



LOO can then be calculated as efficiently as train- 



where K is the kernel matrix defined as in ([6]) without 
having the «th feature in S and v = (Xi) contains 
the values of the ith feature for the training data. 
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Input: X eR nx " 
Output: S, w 

1 5^0; 

2 a <— A _1 y; 

3 G<- A" 1 /; 

4 while 151 < k do 



ye 



k, X 



5 

6 

7 

8 

9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 

25 end 

26 W(- 



n} \ S do 

r v T Gv)- 1 (v T G); 



e <— oo; 
6^0; 

foreach i e {1,. .. 

v <- (X ( ) T ; 
G <- G - Gv(l 
a <- Gy; 
e, <- 0; 

foreach j e {1, . . . , to} do 

.V.. (^.i) ' a ; : 

e, 4- ej + l(yj,p); 
end 

if ej < e then 

e «- e,; 
6 «- i: 



end 



end 

v < 
G 
a < 
S ■ 



- G - Gv(l 
Gy; 

SU{b}; 



v T Gv)- 1 (v T G); 



X s a: 



The time needed to compute ^ is cubic in m, which 
provides no benefit compared to the standard wrap- 
per approach. However, provided that the matrix G 
is stored in memory, it is possible to speed up the 
computation of G via the SMW formula. Formally, 
the matrix G can also be obtained from 



Algorithm 2: Low-rank updated LS-SVM 



G-Gv(l + v T Gv)- 1 (v T G) 



(10) 



which requires a time quadratic in m provided that 
the multiplications are performed in the order indi- 
cated by the parentheses. Having calculated G, the 
updated vector of dual variables a can be obtained 
from 

Gy (11) 

also with 0(m 2 ) floating point operations. 

Now let us consider the efficient evaluation of the 
features with LOO performance. Provided that G 
and a are already stored in memory, the LOO per- 
formance of RLS trained with the S + {i} training 
examples can be calculated via After the fea- 
ture, whose inclusion would be the most favorable 
with respect to the LOO performance, is found, it is 
added to the set of selected features and RLS solu- 
tion is updated accordingly in the last four lines of 
the algorithm. 

The outermost loop is run k times, k denoting the 
number of features to be selected. For each round of 
the outer loop, the middle loop is run 0(n) times, 
because each of the n — \S\ features is evaluated in 
turn before the best of them is added to the set of se- 
lected features S. Due to the efficient formula Q for 
LOO computation, the calculation of the LOO pre- 
dictions for the to requires only 0{m) floating point 
operations in each run of the innermost loop. The 
complexity of the middle loop is dominated by the 
0(m 2 ) time needed for calculating G in line [9] and a in 
line[l0j and hence the overall time complexity of the 
low-rank updated LS-SVM algorithm is 0(knm 2 ). 

The complexity with respect to k is only linear 
which is much better than that of the standard wrap- 
per approach, and hence the selection of large feature 
sets is made possible. However, feature selection with 
large training sets is still infeasible because of the 
quadratic complexity with respect to to. In fact, the 
running time of the standard wrapper approach with 
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the LOO short-cut can be shorter than that of the 
low-rank updated LS-SVM method in cases where 
the training set is large and the number of features 
to be selected is small. 

The space complexity of the low-rank updated LS- 
SVM algorithm is 0(nm + m 2 ), because the matrices 
X and G require 0(nm) and 0(m 2 ) space, respec- 
tively. Due to the quadratic dependence of m, this 
space complexity is worse than that of the standard 
wrapper approach. 

3.3 Linear Time Forward Selection 
Algorithm 

Here, we present our novel algorithm for greedy for- 
ward selection for RLS with LOO criterion. We refer 
to our algorithm as greedy RLS, since in addition to 
feature selection point of view, it can also be consid- 
ered as a greedy algorithm for learning sparse RLS 
predictors. Our method resembles the low-rank up- 
dated LS-SVM in the sense that it also operates by 
iteratively updating the vector of dual variables, and 
hence we define it on the grounds of the considera- 
Pscudo code of greedy RLS is 



tions of Section 3.2 



presented in Algorithm [3] 

The time and space complexities of the low-rank 
updated LS-SVM algorithm are quadratic in m, be- 
cause it involves updating the matrix G (lines [9] and 
22 in Algorithm [2]) and the vector a (lines [To] and [23] 
in Algorithm [2]). In order to improve the efficiency of 
the algorithm by getting rid of the quadratic depen- 
dence of m in both space and time, we must avoid 
the explicit calculation and storing of the matrices 
G and G. Next, we present an improvement which 
solves these problems. 

In the middle loop of the low-rank updated LS- 
SVM algorithm in Algorithm [2j it is assumed that 
the matrix G constructed according to the current 
set S of selected features is stored in memory. It is 
temporarily updated to matrix G corresponding to 
the set iS + {i}, where i is the index of the feature 
to be evaluated. We observe that the LOO compu- 
tations via the formula Q require the vector of dual 
variables a corresponding to S + {i} but only the 
diagonal elements of G. 



Let us first consider speeding up the computation 
of a. Since we already have a stored in memory, we 
can take advantage of it when computing a. That 
is, since the value of the vector a before the update 



is Gy, its updated value a can be, according to (10 1 



and (11), obtained from 



a - Gv(l + v T Gv)- 1 (v T a), 



(12) 



where v = (A i ) T . This does not yet help us much, 



because the form ( 12 ) still contains the matrix G ex- 
plicitly. However, if we assume that, in addition to 
a, the product Gv is calculated in advance and the 



result is stored in memory, calculating (12) only re- 



quires 0(m) time. In fact, since the operation (12) is 



performed for almost every feature during each round 
of the greedy forward selection, we assume that we 
have an m x n cache matrix, denoted by 



G = GX T , 



(13) 



computed in advance containing the required prod- 
ucts for all features, each column corresponding to 
one product. 

Now, given that G is available in memory, we cal- 
culate, in 0{m) time, a vector 



G„M • v' G,,; ; 



(14) 



and store it in memory for later usage. This is done 
in lines [10] and [24] of the algorithm in Algorithm [3] 
Consequently, (12) can be rewritten as 



a — u(v T a). 



(15) 



and hence the vector a can be updated in 0(m) time. 

We also have to ensure that fast computation of the 
LOO performance. The use of ([8| for computing the 
LOO for the updated feature set requires the diagonal 
elements of the matrix G. Let d and d denote the 
vectors containing the diagonal elements of G and G, 



respectively. We observe that the SMW formula ( 10 ) 
for calculating G can be rewritten as 



G-u(G : ,^ 



(16) 



Now, we make an assumption that, instead of having 
the whole G being stored in memory, we have only 
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stored its diagonal elements, that is, the vector d. 



Then, according to ( 16 ) , the jth element of the vector 



d can be computed from 



3,1- 



(17) 



requiring only a constant time, the overall time 
needed for computing d again becoming 0(m). 

Thus, provided that we have all the necessary 
caches available, evaluating each feature requires 
0(m) time, and hence one pass through the whole 
set of n features needs 0{mn) floating point opera- 
tions. But we still have to ensure that the caches can 
be initialized and updated efficiently enough. 

In the initialization phase of the greedy RLS algo- 
rithm (lines Ij4 in Algorithm [3| the set of selected 
features is empty, and hence the values of a, d, and 
C are initialized to A _1 y, A _1 l, and \~ 1 X T , respec- 
tively, where 1 E K m is a vector having every entry 
equal to 1. The computational complexity of the ini- 
tialization phase is dominated by the O(mn) time 
required for initializing C. Thus, the initialization 
phase is no more complex than one pass through the 
features. 

When a new feature is added into the set of se- 
lected features, the vector a is updated according to 
(151 and the vector d according to (17). Putting to- 



Algorithm 3: Greedy RLS algorithm proposed 
by us 



Input: X e R nx " 
Output: S, w 

1 a A _1 y; 

2 d «- A _1 1; 

3 C «- \~ 1 X T ; 

4 5^0; 

5 while 151 < k do 



. y e 



k, A 



gether (13), (14), and (16), the cache matrix C can 
be updated via 

C-u(v T C), 

which requires 0(mn) time. This is equally expensive 
as the above introduced fast approach for trying each 
feature at a time using LOO as a selection criterion. 

Finally, if wc are to select altogether k features, 
the overall time complexity of greedy RLS becomes 
0(kmn). The space complexity is dominated by the 
matrices X and C which both require 0{mn) space. 

4 Experimental results Sl cnd 



6 


e 


<— 


00; 


7 


b 




0; 


8 


foreach i E {1, . . . , n\ \ S 


9 




V 


<- (^) T ; 


10 




m-CUi + vCi)- 1 ; 


11 




a 


<— a u(v T a); 


12 




e 




13 




foreach j E {1, . . . , m} 


14 






dj <- d , u ,(•,.,: 


15 






pi-yj- (dj)- x a i; 


16 






e t <- a + l(yj,p); 


17 




end 


18 




if e, < e then 


19 






e <- e f, 


20 






b-h- i; 


21 




end 


22 


end 




23 


V 


<— 




24 


u <^ 


a b (i+v T a b )- 1 -, 


25 


a <- 


a — u(v T a); 


26 


foreach j E {1, . . . , m} do 


27 




d 




28 


end 




29 


C <- 


C-u(v T C); 


30 


S <- 


SU{b}; 



In Section |4.1| we explore the computational effi- 
ciency of the introduced method and compare it with 
the best previously proposed implementation. In Scc- 



X s a; 



tion 4.2 we explore the quality of the feature selec- 
tion process. In addition, we conduct measurements 



8 



of the degree to which the LOO criterion overfits in 
Section IO 



4.1 Computational efficiency 

First, we present experimental results about the scal- 
ability of our method. We use randomly generated 
data from two normal distributions with 1000 fea- 
tures of which 50 are selected. The number of ex- 
amples is varied. In the first experiment we vary the 
number of training examples between 500 and 5000, 
and provide a comparison to the low-rank updated 
LS-SVM method 
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In the second experiment the 
training set size is varied between 1000 and 50000. 
We do not consider the baseline method in the second 
experiment, as it does not scale up to the considered 
training set sizes. The runtime experiments were run 
on a modern desktop computer with 2.4 GHz Intel 
Core 2 Duo E6600 processor, 8 GB of main memory, 
and 64-bit Ubuntu Linux 9.10 operating system. 

We note that the running times of these two meth- 
ods are not affected by the choice of the regularization 
parameter, or the distribution of the features or the 
class labels. This is in contrast to iterative optimiza- 
tion techniques commonly used to train for example 
support vector machines 26 . Thus we can draw gen- 



eral conclusions about the scalability of the methods 
from experiments with a fixed value for the regular- 
ization parameter, and synthetic data. 

In Fig. [T] are the runtime comparisons with lin- 
ear scaling on y-axis, and in Fig. [2] with logarith- 
mic scaling. The results are consistent with the al- 
gorithmic complexity analysis of the methods. The 



method of 19 shows quadratic scaling with respect 



to the number of training examples, while the pro- 
posed method scales linearly. In Fig. [3] are the run- 
ning times for the proposed method for up to 50000 
training examples, in which case selecting 50 features 
out of 1000 took a bit less than twelve minutes. 



4.2 Quality of selected features 

In this section, we explore the quality of the feature 
selection process. We recall that our greedy RLS al- 
gorithm leads to equivalent results as the algorithms 



— ♦ greedy RLS 

- m low-rank updated LS-SVM 



500 1000 1500 2000 2500 3000 3500 4000 4500 5000 
training set size 



Figure 1: Running times in CPU seconds for the pro- 
posed greedy RLS method and the and the low-rank 
updated LS-SVM of [19]. Linear scaling on y-axis. 
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» -m low-rank updated LS-SVM 
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Figure 2: Running times in CPU seconds for the pro- 
posed greedy RLS method and the and the low-rank 
updated LS-SVM of [T9J Logarithmic scaling on y- 
axis. 
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Table 1: Data sets 



20000 30000 40000 

training set size 



Figure 3: Running times in CPU seconds for the pro- 
posed greedy RLS method. 



proposed by 17 and by 19 , while being computa- 



tionally much more efficient. The aforementioned au- 
thors have in their work shown that the approach 
compares favorably to other commonly used feature 
selection techniques. However, due to computational 
constraints the earlier evaluations have been run on 
small data sets consisting only of tens or hundreds 
of examples. We extend the evaluation to large scale 
learning with the largest of the considered data sets 
consisting of more than hundred thousand training 
examples. Wrapper selection methods do not typ- 
ically scale to such data set sizes, especially when 
using cross-validation as a selection criterion. Thus 
we consider as a baseline an approach which chooses 
k features at random. This is a good sanity-check, 
since training RLS with this approach requires only 
0(min(fc 2 77i, km?)) time that is even less than the 
time required by greedy RLS. 

We consider the binary classification setting, the 
quality of a selected feature set is measured by the 
classification accuracy of a model trained on these 
features, evaluated on independent test data. The 
experiments are run on a number of publicly avail- 
able benchmark data sets, whose characteristics are 
summarized in Table [l] The mnist dataset is orig- 
inally a multi-class digit recognition task, here we 



data set 


^instances 


^features 


adult 


32561 


123 


australian 


683 


14 


colon-cancer 


62 


2000 


german.numer 


1000 


24 


icjnnl 


141691 


22 


mnist 5 


70000 


780 



consider the binary classification task of recognizing 
the digit 5. 

All the presented results are average accuracies 
over stratified tenfold-cross validation run on the full 
data sets. On each of the ten cross-validation rounds, 
before the feature selection experiment is run we se- 
lect the value of the regularization parameter as fol- 
lows. We train the learner on the training folds us- 
ing the full feature set, and perform a grid search 
to choose a suitable regularization parameter value 
based on leave-one-out performance. 

Using the chosen regularization parameter value, 
we begin the incremental feature selection process on 
the training folds. One at a time we select the fea- 
ture, whose addition to the model provides highest 
LOO accuracy estimate on the training folds. Each 
time a feature has been selected, the generalization 
performance of the updated model is evaluated on 
the test fold. The process is carried on until all the 
features have been selected. 

In the presented figures we plot the number of 
selected features against the accuracy achieved on 
the test fold, averaged over the ten cross-validation 
rounds. In Fig. [3] are the results for the adult, in 
Fig. [5] for the australian, in Fig. [6] for colon-cancer, 
in Fig. [7] for german.numer, in Fig. [8] for ijcnnl, and 
in Figure [9] for the mnist5 data set. 

On all data sets, the proposed approach clearly 
outperforms the random selection strategy, suggest- 
ing that the leave-one-out criterion leads to selecting 
informative features. In most cases with only a small 
subset of features one can achieve as good perfor- 
mance as with all the features. 
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Figure 4: Performance on the adult data set. Figure 6: Performance on the colon-cancer data set. 
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Figure 5: Performance on the australian data set. 



Figure 7: Performance on the german.numer data 
set. 
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Figure 10: Test vs. LOO accuracy on the adult data 
set. 



Figure 8: Performance on the ijcnnl data set. 



4.3 Overfitting in feature selection 
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Figure 9: Performance on the mnist5 data set. 



One concern with using the LOO estimate for select- 
ing features is that the estimate may overfit to the 
training data. If one considers large enough set of 
features, it is quite likely that some features will by 
random chance lead to improved LOO estimate, while 
not generalizing outside the training set. We explore 
to what extent and in which settings this is a prob- 
lem by comparing the LOO and test accuracies in the 
previously introduced experimental setting. Again, 
we plot the number of selected features against the 
accuracy estimates. 

In most cases the LOO and test accuracies are close 



to identical (Figures 10 11 14 and 15). However, on 



the colon-cancer and german.numer data sets (Fig- 
ures 12 and 13) we see evidence of overfitting, with 



LOO providing over-optimistic evaluation of perfor- 
mance. The effect is especially clear on the colon- 
cancer data set, which has 2000 features, and only 62 
examples. The results suggest that reliable feature se- 
lection can be problematic on small high-dimensional 
data sets, the type of which are often considered for 
example in bioinformatics. On larger data sets the 
LOO estimate is quite reliable. 
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Figure 11: Test vs. LOO accuracy on the australian Figure 13: Test vs. LOO accuracy on the ger- 
data set. man.numer data set. 
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Figure 12: Test vs. LOO accuracy on the colon- Figure 14: Test vs. LOO accuracy on the ijcnnl data 
cancer data set. set. 
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Figure 15: Test vs. LOO accuracy on the mnist5 
data set. 



5 Discussion and Future Direc- 
tions 

In this paper, we present greedy RLS, a linear time al- 
gorithm for feature selection for RLS. The algorithm 
uses LOO as a criterion for evaluating the goodness 
of the feature subsets and greedy forward selection as 
a search strategy in the set of possible feature sets. 
The proposed algorithm opens several directions for 
future research. In this section, we will briefly dis- 
cuss some of the directions which we consider to be 
the most promising ones. 

Greedy RLS can quite straightforwardly be gener- 
alized to use different types of cross-validation cri- 
teria, such as n-fold or repeated n-fold. These are 
motivated by their smaller variance compared to the 
leave-one-out and they have also been shown to have 
better asymptotic convergence properties for feature 

This 



subset selection for ordinary least-squares 15 



generalization can be achieved by using the short-cut 



methods developed by us 27 and by 28 



In this paper, we focus on the greedy forward 
selection approach but an analogous method can 
also be constructed for backward elimination. How- 
ever, backward elimination would be computation- 
ally much more expensive than forward selection, be- 



cause an RLS predictor has to be trained first with 
the whole feature set. In the feature selection litera- 
ture, approaches which combine both forward and 
backward steps, such as the floating search meth- 
ods (see e.g. 29 30]), have also been proposed. Re- 



cently, [31] considered a modification of the forward 
selection for least-squares, which performs corrective 
steps instead of greedily adding a new feature in each 
iteration. The considered learning algorithm was or- 
dinary least-squares and the feature subset selection 
criterion was empirical risk. The modified approach 
was shown to have approximately the same computa- 
tional complexity (in number of iterations) but better 
performance than greedy forward selection or back- 
ward elimination. A rigorous theoretical justification 
was also presented for the modification. While this 
paper concentrates on the algorithmic implementa- 
tion of the feature subset selection for RLS, we aim 
to investigate the performance of this type of modi- 
fications in our future work. 

The computational short-cuts presented in this pa- 
per are possible due to the closed form solution the 
RLS learning algorithms has. Such short-cuts are also 
available for variations of RLS such as RankRLS, an 
RLS based algorithm for learning to rank proposed 
by us in 32 33 . In our future work, we also plan to 



design and implement similar feature selection algo- 
rithms for RankRLS. 

Analogously to the feature selection methods, 
many approaches has been developed also for so- 
called reduced set selection used in context of kernel- 



based learning algorithms (see e.g 10 34 and refer- 
ences therein). For example, incremental approaches 
for selecting this set for regularized least-squares has 
been developed by 



35 



Closely related to the re- 
duced set methods, we can also mention the methods 
traditionally used for selecting centers for radial basis 
function networks. Namely, the algorithms based on 



the orthogonal least-squares method proposed by 36 
for which efficient forward selection methods are also 
known. In the future, we plan to investigate how well 
approaches similar to our feature selection algorithm 
could perform on the tasks of reduced set or center 
selection. 
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6 Conclusion 

We propose greedy regularized least-squares, a novel 
training algorithm for sparse linear predictors. The 
predictors learned by the algorithm are equivalent 
with those obtained by performing a greedy forward 
feature selection with leave-one-out (LOO) criterion 
for regularized least-squares (RLS), also known as the 
least-squares support vector machine or ridge regres- 
sion. That is, the algorithm works like a wrapper 
type of feature selection method which starts from 
the empty feature set, and on each iteration adds 
the feature whose addition provides the best LOO 
performance. Training a predictor with greedy RLS 
requires 0(kmn) time, where k is the number of non- 
zero entries in the predictor, m is the number of 
training examples, and n is the original number of 
features in the data. This is in contrast to the com- 
putational complexity 0(min{/c 3 m 2 n, k 2 m 3 n}) of us- 
ing the standard wrapper method with LOO selection 
criterion in case RLS is used as a black-box method, 
and the complexity 0(km 2 n) of the method proposed 
by |19| , which is the most efficient of the previously 
proposed speed-ups. Hereby, greedy RLS is compu- 
tationally more efficient than the previously known 
feature selection methods for RLS. 

We demonstrate experimentally the computational 
efficiency of greedy RLS compared to the best previ- 
ously proposed implementation. In addition, we ex- 
plore the quality of the feature selection process and 
measure the degree to which the LOO criterion over- 
fits with different sizes of data sets. 

We have made freely available a software package 
called RLScore out of our previously proposed RLS 
based machine learning algorithm^ An implemen- 
tation of the greedy RLS algorithm will also be made 
available as a part of this software. 
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